AITopics | tomek link

Collaborating Authors

tomek link

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Foundations of data imbalance and solutions for a data democracy

Kulkarni, Ajay, Chong, Deri, Batarseh, Feras A.

arXiv.org Artificial IntelligenceJul-30-2021

Dealing with imbalanced data is a prevalent problem while performing classification on the datasets. Many times, this problem contributes to bias while making decisions or implementing policies. Thus, it is vital to understand the factors which causes imbalance in the data (or class imbalance). Such hidden biases and imbalances can lead to data tyranny, and a major challenge to a data democracy. In this chapter, two essential statistical elements are resolved: the degree of class imbalance and the complexity of the concept, solving such issues helps in building the foundations of a data democracy. Further, statistical measures which are appropriate in these scenarios are discussed and implemented on a real-life dataset (car insurance claims). In the end, popular data-level methods such as Random Oversampling, Random Undersampling, SMOTE, Tomek Link, and others are implemented in Python, and their performance is compared. Keywords - Imbalanced Data, Degree of Class Imbalance, Complexity of the Concept, Statistical Assessment Metrics, Undersampling and Oversampling 1. Motivation & Introduction In the real-world, data are collected from various sources like social networks, websites, logs, and databases. Whilst dealing with data from different sources, it is very crucial to check the quality of the data [1]. Data with questionable quality can introduce different types of biases in various stages of the data science lifecycle. These biases sometime can affect the association between variables, and in many cases could represent the opposite of the actual behavior [2].

dataset, imbalance, porto seguro, (13 more...)

arXiv.org Artificial Intelligence

2108.00071

Genre: Research Report (1.00)

Industry:

Banking & Finance > Insurance (0.68)
Information Technology (0.48)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Stop using SMOTE to handle all your Imbalanced Data

#artificialintelligenceMay-2-2021, 22:10:07 GMT

In classification tasks, one may encounter a situation where the target class label is not equally distributed. Such a dataset can be termed Imbalanced data. Imbalance in data can be a blocker to train a data science model. In case of imbalance class problems, the model is trained mainly on the majority class and the model becomes biased towards the majority class prediction. Hence handling of imbalance class is essential before proceeding to the modeling pipeline.

boundary, smote, tomek link, (16 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.31)

Add feedback

Undersampling Algorithms for Imbalanced Classification

#artificialintelligenceJan-20-2020, 11:35:01 GMT

Taken from Improving Identification of Difficult Small Classes by Balancing Class Distribution. This technique can be implemented using the NeighbourhoodCleaningRule imbalanced-learn class. The number of neighbors used in the ENN and CNN steps can be specified via the n_neighbors argument that defaults to three. The threshold_cleaning controls whether or not the CNN is applied to a given class, which might be useful if there are multiple minority classes with similar sizes. This is kept at 0.5.

dataset, majority class, minority class, (13 more...)

#artificialintelligence

Genre: Instructional Material (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Imbalanced Datasets

@machinelearnbotMay-17-2017, 09:35:11 GMT

Imagine you are a medical professional who is training a classifier to detect whether an individual has an extremely rare disease. You train your classifier, and it yields 99.9% accuracy on your test set. You're overcome with joy by these results, but when you check the labels outputted by the classifier, you see it always outputted "No Disease," regardless of the patient data. Because the disease is extremely rare, there were only a handful of patients with the disease in your dataset compared the thousands of patients without the disease. Because over 99.9% of the patients in your dataset don't have the disease, any classifier can achieve an impressively high accuracy simply by returning "No Disease" to every new patient.

artificial intelligence, dataset, machine learning, (18 more...)

@machinelearnbot

Country: North America > United States > District of Columbia (0.05)

Industry: Health & Medicine (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.31)

Add feedback